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Summary 

Multiple imputation based on chained equations (MICE) is an alternative missing genotype method that can use genetic 
and nongenetic auxiliary data to inform the imputation process. Previously, MICE was successfully tested on strongly 
linked genetic data. We have now tested it on data of the HBA2 gene which, by the experimental design used in a malaria 
association study in Tanzania, shows a high missing data percentage and is weakly linked with the remaining genetic 
markers in the data set. We constructed different imputation models and studied their performance under different missing 
data conditions. Overall, MICE failed to accurately predict the true genotypes. However, using the best imputation model 
for the data, we obtained unbiased estimates for the genetic effects, and association signals of the HBA2 gene on malaria 
positivity. When the whole data set was analyzed with the same imputation model, the association signal increased from 
0.80 to 2.70 before and after imputation, respectively. Conversely, postimputation estimates for the genetic effects remained 
the same in relation to the complete case analysis but showed increased precision. We argue that these postimputation 
estimates are reasonably unbiased, as a result of a good study design based on matching key socio-environmental factors. 



Keywords: Genotype imputation, multiple imputation based on chained equations, HBA2 gene, malaria positivity 



Introduction 

Missing genotypes are common in genetic association studies 
but often discarded from the analysis. This popular practice 
typically decreases statistical efficiency and power in com- 
parison to an analysis where missing data are taken into ac- 
count conveniently. It may also introduce estimation bias, 
particularly when the missing data pattern is not com- 
pletely random. Since the advent of the HapMap project 
and the decrease in costs of genome-wide association stud- 
ies (GWAS), several imputation methods have been proposed 
to deal with missing genotypes of typed or untyped single- 
nucleotide polymorphisms (SNPs; reviewed in Marchini 
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& Howie, 2010). These imputation approaches aim to re- 
place missing genotypes by plausible guesses, borrowing in- 
formation not only from individuals with observed geno- 
types, but also from the underlying linkage disequilibrium 
(LD) and/or haplotype structure between genetic markers. 
To improve the quality of genotype prediction, one can also 
include available maps of recombination or reference data 
from HapMap in the imputation process. Three popular im- 
putation software packages are BEAGLE (Browning, 2006; 
Browning & Browning, 2009), IMPUTE (Marchini et al, 
2007; Howie et al, 2009), and MACH (Li et al, 2010), 
and their performance has been compared to each other in 
different data settings (Pei et al., 2008; Huang et al, 2009; 
Nothnagel et al, 2009; Chanda et al, 2012; Hancock et al, 
2012). Notwithstanding the good performance of current 
genotype imputation methods in GWAS, they should be used 
with caution in alternative settings where the population un- 
der study is not so closely related to the ones in the HapMap 
database, as demonstrated by Jallow et al. (2009). The same 
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applies to studies when there are weak to moderate LD pat- 
terns between the genetic markers with and without missing 
data, due to either hmitations in the study design or the pres- 
ence of latent population structures. In these situations, it 
seems more reasonable and sensible to apply more general 
and flexible imputation approaches. 

Multiple imputation based on chained equations (MICE) 
has proven to be a usefi.il imputation technique in different 
statistical applications (Clark & Altman, 2003; Ambler et al., 
2007; van Buuren, 2007). This general approach is able to 
"rescue" other data that might otherwise be unusable, as a 
way to inform the imputation process in order to generate 
multiple imputed data sets upon which parameter estimation 
is then carried out. The final step of the analysis is to combine 
these imputed-based estimates and the respective standard er- 
rors in order to capture the underlying imputation uncertainty 
(Rubin, 1996). When first applied to genetic data, MICE pro- 
duced unbiased estimates for genetic efliects even when using 
imputation models not including auxiliary genetic informa- 
tion (Souverein et al., 2006). However, this first application 
was performed on European data that typically shows low ge- 
netic diversity together with long haplotypes and LD blocks 
(Conrad et al., 2006; Campbell & Tishkoff, 2008; Jakobs- 
son et al., 2008). It is unclear whether MICE would have 
similar good performance in African data that often exhibits 
shorter LD patterns and haplotype blocks, due to high genetic 
diversity and/ or population substructure. 

This paper deals with genotype imputation in data from 
an association study (~7000 individuals) of malaria parasite 
positivity in Tanzania (Drakeley et al., manuscript in prepara- 
tion). We focus our study on the African a'^'^-globin deletion, 
which occurs in the HBA2 gene. The corresponding data set 
shows two unconventional characteristics. Firstly, the HBA2 
gene is not in strong LD with any other SNP in the data set. 
Thus, commonly used methods, such as BEAGLE, IMPUTE, 
or MACH, are not applicable to our data. Secondly, as a result 
of diflierent sequencing eflxirts, there is an extremely high per- 
centage of missing data on the a^^'^-globin deletion (~62%) 
in contrast to the low missing genotype percentage observed 
for the remaining SNPs in the data set (<10%). As a conse- 
quence, the subsequent statistical inferences will have different 
precision and power if only the complete case data are used. 
In this setting, MICE would appear to be the most promis- 
ing tool to perform genotype imputation because one can 
inform the imputation process with genetic and nongenetic 
data. However, because of the high fraction of missing data 
on the a'^'^-globin deletion, it is unclear whether this method 
could provide accurate results in terms of genetic association 
assessment, genetic effect estimation, and missing genotype 
prediction. The goal of this study was therefore to learn the 
limits of MICE in imputing missing genotypes in this non- 
standard setting. 



Methods 

Study Population 

The data set under analysis is part of a large cross-sectional 
study performed across 24 villages divided into six altitude 
transects (150—1800 m) in the Kilimanjaro and Tanga regions 
of Tanzania, as described elsewhere (Drakeley et al., 2005). 
The original sample size is about 7000 individuals with age 
between 6 months and 45 years old. Genotyping of the HBA2 
gene was only attempted in a subset of individuals living in 
13 villages from four altitude-based transects (see Table 1). 
Considering individuals living in these 13 villages only, the 
sample size drops to 4414 individuals, for 36% of whom there 
is no information on the number of the a'^'^-globin deletions. 

Phenotype and Genotype Data 

Genotype, clinical and background data were generated as 
part of an ongoing larger project (http://www.malariagen. 
net; Drakeley et al., in preparation). In brief, genetic data refer 
to a total of 165 SNPs predominantly from malaria candidate 
genes (e.g., the sickle cell gene; HbS), of which 110 passed our 
stringent quality control step (minor allele frequency >5%, 
p-value for Hardy- Weinberg equilibrium in malaria-negative 
individuals >0.001, and missing data per SNP or individual 
>10%). A complete list of the SNPs can be found elsewhere 
for a similar study in Mali (Maiga et al., 2013). 

We have also included data on the African a'^^-globin 
deletion that occurs in the HBA2 gene located in the telom- 
eric region of Chromosome 16 (16pl3.3). Experimentally, 
the number of a'^'^-globin deletions per individual genotype 
was determined by polymerase chain reaction assays as de- 
scribed elsewhere (Liu et al., 2000). Briefly, the PCR was 
carried out in a tetrad thermocycler (PTC-0240, The DNA 
engine Tetrad2® Thermal Cycler, Bio-Rad, Hercules, Cal- 
ifornia, USA). DNA samples from individuals with known 
a'^^-globin deletion status (no deletion, one deletion, or two 
deletions) and samples with no DNA template as a nega- 
tive control were included to each set of PCR reactions, as 
controls to assess the status of a^"^-globin deletion in the in- 
dividuals. This was followed by gel electrophoresis in order 
to determine the number of a^"^-globin deletions (0, 1, or 2) 
defining the genotype of an individual. 

For each individual in the study, we also have information 
on ethnicity, gender, age, hemoglobulin (Hb) levels, Plasmod- 
ium falciparum positivity (detected by microscopy inspection 
of blood smears on slides), and parasite density. 

Genetic Association Analysis 

The purpose of the genetic association analysis is to de- 
termine whether the number of the a^^-globin deletions 
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affects Plasmodium falciparum positivity (hereafter referred to 
as malaria parasite positivity). We first performed an unad- 
justed association analysis between the number of a^'^-globin 
deletions and parasite positivity using the Pearson's test 
for two-way contingency tables. We then used logistic re- 
gression to test the genetic association, adjusting for putative 
socio-environniental confounders, such as "ethnicity" (Wa- 
pare, Wasambaa, Wachaga, and others), "age" (in years), 
"transect" (Kilimanjaro, South Pare, West Usambara, and 
Tanga coast) and "gender" (male/female). In this analysis, we 
compared two nested models, one including putative con- 
founders only and another including both conftjunders and 
the genetic effects of the HBA2 locus. In the latter model, we 
considered two independent genetic effects of the HBA2 lo- 
cus {Xi and A.2 ), one associated with one deletion and another 
associated with two deletions. Model comparison was made 
by means of a likelihood ratio test, where a small p-value (e.g., 
<0.05) is indicative of a statistical association between the 
HBA2 locus and malaria parasite positivity. Alternatively, one 
can use —logio (p-value) as a measure of association strength. 
Large values of this quantity suggest a strong association be- 
tween the locus under analysis and the phenotype. 



Analysis of Missing Data 

For each village where a^^-globin genotyping was actually 
attempted, we tested whether missing genotypes were com- 
pletely at random (MCAR) assuming a missing at random 
(MAR) mechanism as the saturated model for the data. This 
analysis was performed using the ACD package for the R 
software (Poleto et al, 2014). 

We then used MICE to impute missing data under different 
scenarios and imputation models based on three-category lo- 
gistic regression under a Multinomial sampling scheme (Sou- 
verein et al., 2006). In particular, we used the following lo- 
gistic regression: 



log^^ = A. + I]A^', 

^0 rr 



and 



1 ^2 
log — = 



where |j/ is the probability of an individual having i a'^'^-globin 
deletions, x, is the (th among m covariates and, P, and fi* are 
the corresponding main effects. To construct the imputation 
model, we assessed the association between genetic and non- 
genetic variables and the number of a^"^-globin deletions, 
using the Pearson's test for two-way contingency tables or 
the likelihood ratio test for quantitative variables using poly- 
tonious logistic regression. MICE is based on the generation 
of imputed data sets by the following algorithm: (1) generate 
random initial values for the missing genotypes, (2) estimate 
the above logistic regression model, (3) generate new random 
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values for the original missing genotypes using the genotype 
probabilities predicted by the fitted model, (4) repeat steps 2 
and 3 until convergence (of the chain), (5) estimate the cor- 
responding genetic association model using the final iteration 
of the generated chain. To generate each imputed data set, 
we repeated the above algorithm with 25 or 100 iterations 
in order to have stable predictions for the missing data. For 
each (real or generated) data set, we generated 25 or 100 im- 
puted data sets to capture the whole uncertainty underlying 
the imputation process. 

After performing the association analysis in each imputed 
data set, one needs to combine the respective results into final 
estimates for the parameters of interest (Rubin, 1996). To this 
end, the genetic effects X i and A? were estimated by the mean 
of the respective postimputation estimates, that is, 

k 

Xi = Xij/k, i — 1 and 2 
./=i 

where A.,y is the estimate of A, using the j-th imputed data set. 
The associated standard errors were given by 

= V ^ — + — " —k^i — ■ 

for a sufficiently large number of imputed data sets, the 
combined estiniatesA; foUow a Gaussian distribution approx- 
imately. 

Simulation Study 

We carried out a simulation study in order to assess the per- 
formance of MICE on our data. We first adopted the standard 
approach of using the complete case data and treating a frac- 
tion of the existing genotypes as if they were missing, and 
attempted to impute them. We generated the missing geno- 
types independently of true Q!"'"^-globin genotype (MCAR 
assumption). We generated 100 different random data sets 
with missing genotypes using three missing genotype propor- 
tions, 0.10, 0.25, and 0.50. Using the complete case data set 
again, we addressed a second situation where all genotypes 
were assumed missing from a given village. 

To measure the performance of MICE using different im- 
putation models, we compared different parametric infer- 
ences obtained from the complete case analysis (CCA) to 
those obtained after imputation. In particular, we calculated: 
(i) a pseudo estimation bias for the postimputation estimates 
of the average number of a^^^-globin deletions and the ge- 
netic effects for the phenotype-genotype association, (ii) a 
"pseudo coverage" of the postimputation 95% confidence in- 
tervals for the genetic effects, and (iii) the genotype error 
rate and r^ statistic to assess genotype imputation accuracy. 



We defined a "pseudo estimation bias" as the average differ- 
ence between postimputation estimates and those obtained 
from CCA, while "pseudo coverage" was determined by the 
proportion of postimputation 95% confidence intervals that 
contained the genetic effects estimated from CCA. Geno- 
type error rate was determined by the percentage of imputed 
genotypes in disagreement with the true ones. To estimate 
the average number of a^^-globin deletions per individual, 
we used the following formula 

2 

G = ^ i X /, = /i + 2 /2 
;=o 

where fo, fi, and^ are the proportions of an individual having 
0, 1, and 2 a^^'^-globin deletions, respectively. In the case of 
the r^ statistic, we calculated the square of the correlation 
coefficient between the imputed and the true numbers of 
a^^-globin deletions in the individuals with missing data, as 
done elsewhere (e.g., Li et al., 2011). 

Statistical Software 

All statistical analyses (genetic association, imputation, and 
simulation) were carried out using the R software package 
(version 2.15). The corresponding scripts are available from 
the first author upon request. 

Results 

Tanzanian Villages Are Heterogeneous in Terms 
of Background, Clinical, and a^'^-Globin 
Deletion Data 

Sample size, age, and gender are reasonably matched for the 13 
villages where a^"^-globin genotyping was actually attempted 
(Table 1). Altitude varies from 1702 ni in Mokala (Kiliman- 
jaro) to 196 m in the coastal village of Mgome. In each tran- 
sect, a specific ethnic group predominates: Wachaga in Kil- 
imanjaro (with the exception of Kileo where Wapare is the 
main ethnic group), Wapare in South Pare, Wasambaa in West 
Usambara, and a mixed-ethnicity group in the Tanga coast. In 
the latter, the mixed-ethnicity group includes individuals of 
different ethnic groups but not the ones predominantly found 
in the other transects. 

Villages are heterogeneous in terms of malaria parasite 
positivity and Hb levels. The overall malaria parasite posi- 
tivity is 15.6%, ranging from 1.7% in the high-altitude vil- 
lage of Machame Aleni to 48.9% in the lowland of Mgome. 
The overall prevalence of mild anemia is 29.2% but varies 
considerably from village to village, where Machame Aleni 
and Mgila show the lowest and highest prevalence, 9.0% and 
51.0%, respectively. 
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Table 2 Association analysis between the number of a^^'-globin 
deletions and malaria parasite positivity using complete case data. 
"Log-likelihood" refers to the maximum value of the log-likelihood 
function after maximum likelihood parameter estimation. Associa- 
tion analysis was performed adjusting or not for putative confounders 
(age, gender, ethnicity, altitude, and transect) . Association signals re- 
fer to — logio(p-value), where p-value is from Pearson's test for 
two-way contingency tables in the unadjusted analysis, and from the 
hkelihood ratio test for lack of genetic association in the adjusted 
analysis. 





Unadjuste 


d analysis 


Adjusted analysis 


Parameter 


Estimate 


SE 


Estimate 


SE 


Average no. of deletions 


0.337 


0.010 








0.408 


0.121 


-0.225 


0.138 




1.165 


0.250 


0.179 


0.283 


Log-likelihood 


-1069.16 




-876.99 




Association signal 


5.71 




0.80 





With respect to the number of a^^-globin deletions per 
individual, the corresponding distribution differs among vil- 
lages. The high-altitude villages of Mokala, Machame Aleni, 
Ikuini, and Bwambo show percentages of zero a'^'^-globin 
deletions higher than 80%. On the other hand, 66% of the 
individuals from Mgonie have at least one a^^-globin dele- 
tion. The amount of missing data on a^^'^-globin deletions 
varies with villages due to different genotyping efforts, rang- 
ing from 7 out of 236 individuals (3.0%) in Mgome to 196 
out of 378 individuals (51.9%) in Mokala. The UBA2 locus 
seems in Hardy- Weinberg equilibrium in most villages with 
the exception of Kadando, Tamota, Mgila, and Mgome where 
malaria tends to exert a higher selective pressure (Table 1). 

Association Analysis Using Complete Case Data 

We started our association study by analyzing data only from 
individuals with no missing genotypes. The respective sample 
size decreased from 4414 to 2826 individuals. In this restricted 
data set, the average number of a'^'^-globin deletions is around 
0.337 (95% confidence interval: 0.317-0.357; Table 2). Un- 
adjusted association analysis provided evidence for a strong 
effect of the a''"^-globin deletions on malaria parasite posi- 
tivity (association signal: —logio (p-value) = 5.71). However, 
after adjusting for putative confounders, the association sig- 
nal dropped to 0.80 (Table 2). This analysis suggests that the 
number of a^^-globin deletions is not associated with malaria 
parasite positivity, a result in opposition to the strong associ- 
ation signals for severe malarial anemia in Tanzanian children 
(Manjurano et al., 2012) and clinical phenotypes in Kenya 
(WiUiams, Wambua et al, 2005). 



Missing Genotypes Are Generated by Different 
Random Mechanisms across Villages 

Before performing data imputation per se, we first tested 
whether missing data from each village would foUow either 
an MCAR or an MAR mechanism (Fig. lA). Overall, there 
are 7 of 13 villages where an MCAR mechanism holds at 
the 5% significance level. Missing data from the remaining 
villages are instead compatible with an MAR mechanism. 
Curiously, an MAR mechanism holds on data from the four 
villages in the West Usambara transect, whereas an MCAR 
is a reasonable assumption for data from villages in the South 
Pare transect. A mixture of MCAR and MAR mechanisms 
is found for data of villages from the Kilimanjaro transect. 
Therefore, different efforts in genotyping the HBA2 gene 
across study sites resulted in distinct missing data mechanisms 
for the data. 

Building the Imputation Models for MICE 

Motivated by the high frequency of missing data for the 
a^-^-globin deletions (1588 out of 4414), we studied the 
performance of MICE in estimating different parameters of 
interest. Using the subset of 2826 individuals with complete 
information on a'^'^-globin deletion status, we conducted 
a preliminary association analysis between that variable and 
the remaining data in order to identify the most informative 
covariates for the imputation process. In this analysis, we had 
a total of 110 SNPs that passed our stringent quality control 
step, five phenotypes measured in the individuals, and five 
socioenvironmental factors (Fig. IB). There are only four 
SNPs strongly associated with the number of a'^'^-globin 
deletions: rsl800629 {TNFa, chr. 6), rs3211938 {CD36, 
chr. 7), rs334 {HbS, chr. 11), and rs542998 {RTN3, chr. 
11). The association between HbS and a^^-globin deletions 
was previously detected in the same data set by observing a 
higher frequency of heterozygous individuals in both genes 
than that expected under independent segregation (Enevold 
et al., 2007). In contrast, a weak LD between these two loci 
was previously detected in the neighboring country Kenya 
(Williams, Mwangi et al., 2005). In addition to these four 
SNPs, missing data may also be informed from the following 
covariates: Hb levels, mild anemia and malaria parasite 
positivity, and transect. These eight data variables were then 
used to construct three imputation models with the following 
covariates: (i) rsl800629, rs3211938, rs334, and rs542998 
(model IMi); (ii) Hb levels, mild anemia positivity, malaria 
parasite positivity, and transect (model IM2), and (iii) all 
covariates included in previous models (model IM3). We 
studied the performance of each model and compared them 
to the simple multiple imputation procedure based on the 
observed frequency of a^"^-globin deletions (model IMq). 
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Figure 1 (A) Testing a missing completely at random (MCAR) hypothesis under the basic assumption of the missing at 
random (MAR) model for the data resulting from the cross-tabulation of the number of a^^-globin deletions with malaria 
parasite positivity. Each dot represents the p-value for the corresponding likelihood ratio test. Horizontal pointed line refers to 
the 5% significance level. In this analysis, we accepted the MCAR hypothesis on data from villages where p-value >0.05. The 
rejection of MCAR led to the acceptance of an MAR mechanism. (B) Association analysis between or^^'-globin deletions and 
different variables (phenotypes — at the left, SNPs — at the centre, and socioenvironmental factors — at the right) using complete 
data. Association signal is expressed in terms of — logio(p-value) for the corresponding association test: test for categorical 
explanatory variables (SNPs, low Hb, anemia, parasite positivity, gender, transect, vOlage, and ethnicity) and score tests for 
quantitative explanatory variables (Hb levels, parasite density, age, and altitude) using a three-category logistic regression 
framework. Horizontal dashed line refers to — logio(O.OOl) corresponding to a 0.1% significance level. 
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Imputation Models IM2 and IM3 Led to the 
Smallest Decrease in the Association Signals 
When Missing Data Are Completely at Random 

We first studied the performance of MICE simulating missing 
data under an MCAR mechanism. The estimated genotype 
error rates are ~44% irrespective of the imputation model 
used, including IMo, and the amount of missing genotypes 
in the data set (Table 3). In line with this result, we also 
found a low value of the r^ statistic on average (r^ < 0.10; 
Table SI). Therefore, in the absence of SNPs in strong LD 
with the HBA2 gene, MICE has limited power in predicting 
the missing number of a^^-globin deletions with great preci- 
sion. Notwithstanding this limitation, all imputation models 
could provide unbiased estimates for the average number of 
a^^'^-globin deletions. The range of these estimates increased 
with the missing data proportion. 

With respect to the association signals, the different imputa- 
tion models led to an overall decrease in the maximum value of 
the log-likelihood, irrespective of the missing data proportion 
(Table 4). This decrease might be explained by the absence 
of additional covariates from the imputation models. A direct 
consequence of this result is a weakening of the association 
signal in relation to that of the CCA. IM2 and IM3 led to 
little estimation bias (0.11%— 0.13%) of the maximum of the 
CCA log-likelihood function. Although producing genetic 
effect estimates with least bias, IMo and IMi led to the largest 
estimation bias (~1%) of the maximum of the CCA log- 
likelihood function. Finally, every imputation model could 
generate confidence intervals that all included the genetic ef- 
fect estimates obtained from CCA. This result indicates that 
there is a high degree of uncertainty underlying the genetic 
effect estimation. 

MICE May Show Estimation Bias If All Missing 
Genotypes Were Assumed to Come from the 
Same Study Site 

We then studied the performance of MICE in a scenario 
where the genotyping of the HBA2 gene was not attempted 
in individuals from a given village. The genotype error rate 
varies with imputation models and villages with missing geno- 
types (Table 3). The lowest genotype error rate was ob- 
tained from both IM2 and IM3 (-40%; 44.3% - IM,,, 43.7% 

— IMi). Similar qualitative conclusion can be taken from the 
r^ statistic (Table SI). In these two models, the lowest and 
highest genotype errors were obtained from data of Mokala 
in the Kihnianjaro transect (genotype error rates: 26.8% — 
IM2, 26.5% - IM3) and Mgonie in the Tanga coast (56.6% 

— IM2, 55.2% — IM3), respectively. Conversely, the highest 
value for the r^ statistic was found for Kileo data (0.015 — 



IM2 and 0.017 - IM3) with the exception of the IM2 for the 
Mokala data (r" = 0.20). Curiously, each imputation model 
could perform better in missing data from villages in the Kil- 
imanjaro transect than in the others. The reason for this is 
unclear but it might be related to a stronger association of 
the a'^'^-globin deletions with the phenotypes and/or socio- 
environmental factors in this specific transect. On the other 
hand, all imputation models produced a very high genotype 
error rate for the missing data from Mgome (>55%). This 
result can be explained by a different selective pressure in this 
village due to a higher malaria exposure (Table 1). 

Using IMi, IM2, and IM3, there is an inverse correlation 
between altitude and genotype error rates. Missing genotypes 
tend to be better predicted in high-altitude villages than in 
those at the lowlands of the respective transects. An example of 
this is the West Usambara transect where the genotype error 
rate varies from 31.8% in Bwambo, a high-altitude village, 
to 51.6% in Kadando in the lowlands of that transect. This 
result is in line with an increasing number of zero a^'^-globin 
deletions from the lowlands to high-altitude sites where the 
malaria positivity rates are much lower (Table 1). 

With respect to the average number of a'^'^-globin dele- 
tions, it is difficult to ascertain the imputation model with the 
best performance (Table 3). The corresponding estimates can 
be biased or unbiased, depending on the source of the missing 
data. On the one hand, one can obtain biased estimates when 
missing data come from Machame Aleiii, even if the under- 
lying genotype error rate is reasonably low (26.8%, IM3). On 
the other hand, irrespective of the imputation model used, lit- 
tle bias was obtained when missing data were from villages in 
the West Usambara transect. Imputing genotypes of Mgonie, 
although leading to a high genotype error rate, did not lead 
to an extremely high estimation bias. 

The simple imputation procedure based on genotype pro- 
portions (IMo) led to approximately unbiased association sig- 
nals and genetic effect estimates, irrespective of the missing 
data under analysis (Table 4). In contrast, IMi impHed an 
overall weakening of the association signals by decreasing the 
maximum of the CCA log-likelihood function by 0.2%. The 
corresponding genetic effect estimates are approximately un- 
biased with the exception of A i when missing data came from 
Tamota. Both IM2 and IM3 tend to strengthen the association 
signals by increasing the maximum of the CCA log-likelihood 
function. The highest increase in the association signal is ob- 
served for IM2 when missing data came from Tamota (~1.1% 
inflation of the maximum of the CCA log-likelihood func- 
tion). The corresponding genetic effect estimates are in most 
cases unbiased with a few exceptions. IM2 seems to lead to 
biased estimates for Xx when missing data came from Mokala 
(Kilimanjaro), Mpinji (South Pare), Tamota (West Usambara), 
or Mgome (Tanga coast). Conversely, IM3 tends to produce 
biased estimates of A.2. 
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Imputation of Truly Missing Data from Lowland 
Villages Significantly Increased the Association 
Signals 

We finally performed genotype imputation on truly miss- 
ing data and the corresponding association analysis. Table S2 
shows the background information for the 1 1 villages where 
the genotyping of the HBA2 locus was not actually at- 
tempted. In brief, these data seem to be fairly matched for key 
socio-environmental factors and, more importantly, we can 
still find an inverse correlation between malaria parasite pos- 
itivity and altitude. 

We performed three different association analysis: (i) im- 
putation of missing data from the 13 villages studied thus far 
{scenario 1), (ii) imputation of missing data from the same 13 
villages and an additional village where genotyping of the 
HBA2 gene was not attempted (scenario 2), and (iii) impu- 
tation of missing data from the 24 villages included in the 
original study design {scenario 3). According to our simulation 
study, we considered IM3 as the best imputation model for 
the data because it appears to produce a good compromise be- 
tween estimation and association bias. The remaining results 
for IM|), IMi, and IM2 can be found in Table S3. 

In the case of scenario 1, the average number of a^'''- 
globin deletions decreased from 0.337 (Table 2; CCA) to 
0.326 (Table 5; after imputation). In turn, the association 
signal increased from 0.80 (Table 2) to 1.47 (Table 5). The 
estimates of the underlying genetic effects are similar but the 
respective standard errors are now smaller {SE{Xi) — 0.124 
and SE{X2) — 0.271 after imputation vs. SE{Xi) — 0.138 and 
SE{X2) — 0.283 in CCA). Therefore, a stronger association 
signal seems a direct effect of increasing the sample size. 

With respect to scenario 2, the overall missing genotype 
percentage varies from 37.2% (Ubiri) to 40.0% (Ngulu). The 
estimates of the average number of a'^^-globin deletions are 
similar to the ones from previous imputation scenario, irre- 
spective of the village added into the analysis with the excep- 
tion of Kwemasimba (average number of a'^'^-globin dele- 
tions = 0.337). The mean association signals, ranging from 
1.48 (Kilonieni) to 1.88 (Tewe), are inversely correlated with 
altitude across villages of the same transect. Genetic effect es- 
timates range from —0.246 (Ubiri) to —0.216 (Handei) fori/ 
and from 0.165 (Ubiri) to 0.208 (Handei) for X2, respectively. 

When we analysed the whole data set of the 24 villages 
(n = 7048) where missing genotypes is close to 62%, the 
average number of a^^'^-globin deletions was estimated to 
be 0.336 after imputation (Table 5), a value in line with 
that from CCA but associated with a higher standard error 
{SE — 0.010 and 0.034 before and after imputation, respec- 
tively). The post imputation association signals span more 
than six orders of magnitude (from 0.10 to 6.75) with an 
average of 2.70. The genetic effect estimates for Xj and X2 



are -0.232 {SE{Xt) = 0.122) and 0.183 {SE{X2) = 0.252), 
respectively. These estimates, though associated with smaller 
standard errors, do not differ substantially from those obtained 
from CCA. This result suggests that the imputation is captur- 
ing the same genetic effects produced by CCA but increasing 
estimation precision due to a larger number of individuals 
included in the analysis. 

Discussion 

This study concerns MICE and its utility in imputing miss- 
ing data from Tanzania on the a'^^-globin deletion. We 
showed that, when missing genotypes were simulated under 
an MCAR assumption, MICE using model IM3 led to geno- 
typic error rates of ~44% and low r" statistics but unbiased 
estimates for the different genetic parameters and association 
signals. We then conclude that MICE is useless in predicting 
the true status of the HBA2 gene of a particular individual but 
is rather useful in providing reliable genetic effect estimates 
and association signals. In theory, we expect that MICE would 
show better genotype prediction accuracy if we had in our 
data set any genetic marker in strong LD with the HBA2 
gene. In line with this expectation is the study of Souverein 
et al. (2006) where MICE, when applied to linked genetic 
marker data, led to genotype error rates <20%. However, it 
is still unclear whether MICE using an appropriate imputa- 
tion model is capable of decreasing the genotype error rates 
to figures close to those observed in GWAS (~5%— 6% on 
average; Marchini & Howie, 2010), thus allowing the perfor- 
mance of accurate haplotype analysis. Previous studies have 
shown that the genotype prediction accuracy is dependent on 
the number of strongly linked SNPs to the one with miss- 
ing information, the amount of missing data, the population 
under analysis, and/or reference panel, if any is used to in- 
form the imputation process (Pei et al., 2008; Huang et al., 
2009; Nothnagel et al., 2009; Chanda et al, 2012; Hancock 
et al., 2012). Additionally, genotype error rates in MICE are 
dependent on the imputation model used for the analysis, as 
demonstrated in our simulation study. Since MICE has the 
possibility of including genetic and nongenetic information in 
the imputation process, we do not expect any theoretical im- 
pediment of this framework in predicting missing genotypes 
more accurately. 

We also assessed the performance of MICE assuming that 
the genotyping of the HBA2 gene was not attempted in indi- 
viduals from a given village. Since different combinations of 
socio-environmental factors characterize the villages in this 
study, the resulting missing data might be dependent on spe- 
cific characteristics of the individuals of these villages, thus 
leading to a putative missing not at random (MNAR) mecha- 
nism. In this case, MICE using model IM3 can lead to biased 
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Table 5 Genetic association analysis using IM3 (100 imputed data sets) under different data settings: (i) data of 13 villages where genotyping 
of the HBA2 gene was attempted in the majority of the individuals, (ii) data of the same 13 villages and an additional village where geno- 
typing was not attempted, and (iii) all data from the 24 villages. 

Estimates (SE) 

Total sample Missing Mean association Average no. 



Analysis 


size, n 


genotypes, 


% signal (range)" 


of deletions 




>^2 


13 villages'' 


4143 


34.9 


1.47 (0.42-3.22) 


0.326 (0.01) 


-0.226 (0.12) 


0.181 (0.27) 


1 3 villages and an additional viUag 


e'' 












North Pare 














Kilomeni 


4433 


39.1 


1.47 (0.19-3.68) 


0.32 (0.01) 


-0.22 (0.13) 


0.19 (0.270) 


1 0 m r\ 


4405 


38.7 


1 59 fO 1 8-4 28'> 


0 3? fO 0 1 "1 


—0 ?4 ('0 1 \\ 


0 1 8 I'O ?T\ 


Ngulu 


4499 


40.0 


1.62 (0.29-4.75) 


0.33 (0.01) 


-0.23 (0.13) 


0.17 (0.28) 


Kambi ya Simba 


4356 


38.0 


1.65 (0.13-4.42) 


0.33 (0.01) 


-0.23 (0.13) 


0.20 (0.26) 


West Usambara 1 














Emmao 


4321 


37.5 


1.51 (0.20-3.98) 


0.33 (0.01) 


-0.23 (0.13) 


0.17 (0.27) 


Handel 


4489 


39.9 


1.67 (0.38-4.87) 


0.33 (0.01) 


-0.23 (0.12) 


0.21 (0.27) 


Tewe 


4463 


39.5 


1.88 (0.21-5.21) 


0.33 (0.01) 


-0.23 (0.13) 


0.20 (0.27) 


Mn'galo 


4490 


39.9 


1.75 (0.27-4.56) 


0.33 (0.01) 


-0.22 (0.13) 


0.21 (0.27) 


West Usambara 2 














Magamba 


4351 


38.0 


1.56 (0.59-3.51) 


0.32 (0.01) 


-0.23 (0.12) 


0.20 (0.27) 


Ubiri 


4297 


37.2 


1.77 (0.07-5.08) 


0.33 (0.01) 


-0.25 (0.13) 


0.17 (0.27) 


Kwemasimba 


4374 


38.3 


1.83 (0.21-4.63) 


0.34 (0.01) 


-0.24 (0.13) 


0.18 (0.27) 


24 villages'^ 


7048 


61.7 


2.70 (0.10-6.75) 


0.34 (0.03) 


-0.23 (0.12) 


0.18 (0.25) 


"Association signal is calculated by 


-logifj(p-value) 


using either the 


mean or the median of log 


-likelihood ratio statistic across all 


imputed data 



sets. 

''Results of model IM3 were obtained from imputed data using chains of 25 iterations and random initial conditions. 
^Results of model IM3 were obtained from imputed data using chains of 100 iterations and random initial conditions. 



or unbiased genetic effect estimates and association signals, 
depending on the villages where missing data comes from. 
The source of this bias is not totally clear but it may reflect 
the "strength" of an MNAR mechanism for the data. In fact, 
our results showed little bias while analyzing missing data from 
villages in intermediate altitudes because imputation can bor- 
row information from other villages characterized by similar 
genetic and socio-environmental factors. Conversely, there is 
a significant bias when missing data comes from the coastal vil- 
lage of Mgome, which exhibits unique characteristics among 
aU the 13 villages used in our simulation study. Thus, MICE 
under a putative MNAR mechanism should be used with 
caution, as also alerted by Souverein et al. (2006). 

The possibility of producing biased estimates when missing 
data comes from a specific village poses the question about 
the reliability and robustness of the results for the whole data 
set where there is no 0!''"^-globin data for entire transects 
(North Pare and West Usambara 1 and 2). In this case, it is 
expected that missing genotypes are the result of a mixture 
between MCAR, MAR, and MNAR mechanisms. MAR or 
MCAR might arise from data of villages where genotyping 
of the HBA2 gene was actually attempted, while MNAR 
might occur in data from villages where genotyping of that 
gene was not attempted. In this overall analysis, the associ- 



ation signals of the a'^'^-globin deletions became statistically 
significant, most likely due to an increased number of indi- 
viduals under the analysis. If true, the use of MICE, besides 
increasing estimation precision, seems invaluable in detect- 
ing additional association signals from known and unknown 
genes with smaller effect that otherwise would remain un- 
detected. This is particularly relevant in association studies of 
nonclinical malaria phenotypes where different genes with 
only moderate effects could be detected, as demonstrated by 
recent data from Mali (Maiga et al., 2013). 

We believe that, under the realistic assumption of similar ge- 
netic pressure across transects, our results for the whole data set 
show no strong bias as a result of our weU-balanced study de- 
sign. The use of transects and their relationship with the high 
prevalence of specific ethnic groups was invaluable in control- 
hng putative genetic substructures in the study area. A design 
using altitude-based transects has ensured the comparability 
of the villages in terms of malaria parasite exposure. The addi- 
tional matching for gender and age has minimized further se- 
lective bias. The key to minimal estimation bias after data im- 
putation therefore relies on having an adequate study design. 

In conclusion, MICE using an appropriate imputation 
model shows limitations in predicting the missing data of the 
a^"^-globin deletions of a given individual, but seems to lead 
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to the correct genetic effect estimates and association signals as 
long as the data are weU-matched for key socio-environniental 
factors. 
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Supporting Information 

Additional Supporting Information may be found in the on- 
line version of this article: 

Table SI Mean genotype accuracy (range in brackets) us- 
ing the squared correlation between the imputed and true 
number of a'^'^-globin deletions, where IMo refers to imputa- 
tion carried out using the observed frequencies of a'^'^-globin 
deletions, IMi includes four SNPs as imputation covariates 
(rsl800629, rs3211938, rs334, and rs542998), IM2 includes 
eight phenotypes and socioenvironmental factors (Hb, mild 
anemia, malaria parasite positivity, transect, altitude, and eth- 
nicity), and M3 includes all variables in Mi and M2. 

Table S2 Background information of the 11 villages where 
a^'^-globin genotyping was not attempted. 

Table S3 Association analysis after carrying out genotype 
imputation by means of IMq and IMi using data from: (i) the 
13 villages where a''"^-globin genotyping was attempted in 
the majority of the individuals, (ii) the same 13 villages and 
an additional village where a^"^-globin genotyping was not 
attempted, and (iii) all the 24 villages. 
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